-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add data split mode to DMatrix MetaInfo #8568
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the one in learner if DMatrix contains all the information?
Hmm, if we don't declare a training parameter, how would the user specify it? I imagine if someone wants to do column-wise split, they'd do something like: param = { 'tree_method': 'gpu_hist', data_split_mode: 'col'}
xgb.train(param, ...) |
If the DMatrix contains the information the user doesn't need to specify that parameter right? |
Yeah I guess the question is whether to specify it through dmatrix or as a training parameter. dtrain = xgb.DMatrix('very-wide.txt.train', data_split_mode='col')
dtest = xgb.DMatrix('very-wide.txt.test', data_split_mode='col')
param = {'objective': 'binary:logistic', 'tree_method': 'gpu_hist'}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 500
xgb.train(param, dtrain, num_round, evals=watchlist) vs. dtrain = xgb.DMatrix('very-wide.txt.train')
dtest = xgb.DMatrix('very-wide.txt.test')
param = {'objective': 'binary:logistic', 'tree_method': 'gpu_hist', 'data_split_mode': 'col'}
watchlist = [(dtest, 'eval'), (dtrain, 'train')]
num_round = 500
xgb.train(param, dtrain, num_round, evals=watchlist) Which one do you think it's more natural or user friendly? |
@trivialfis this is what it looks like if we take out the |
I think the DMatrix one is better. A little bit of background here, @rongou shared some potential changes with me offline for federated learning. One important factor is how we handle the data split parameter in the predictor. The original plan was to pass it as part of the learner train param or learner model param. We later thought it might be possible to make it part of the DMatrix, and all algorithms will dispatch based on the DMatrix it receives, which from my perspective is cleaner than the learner option. cc @RAMitchell . |
@trivialfis Turns out this actually solves a problem we talked about before. If the user has done some preprocessing to split the training data beforehand, now they can pass in Other than that bit of additional functionality, this PR preserves the existing behavior. Do you think we should merge it? The CI should pass now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on using DMatrix as the parameter holder. Some other questions in comments.
@trivialfis @hcho3 I think this is close to what we want. Please take another look. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me. Excellent work! Some questions regarding the interface and the scope of the new parameter.
include/xgboost/c_api.h
Outdated
@@ -126,12 +126,29 @@ XGB_DLL int XGBGetGlobalConfig(char const **out_config); | |||
|
|||
/*! | |||
* \brief load a data matrix | |||
* \deprecated since 1.7.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the next big one would be 2.0 unless something major comes up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
include/xgboost/c_api.h
Outdated
* \param out a loaded data matrix | ||
* \return 0 when success, -1 when failure happens | ||
*/ | ||
XGB_DLL int XGDMatrixCreateFromFileV2(char const *config, DMatrixHandle *out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered making it a proper API for URI? For instance FromURI
? Since users actually need to use some of the URI formats like file.txt?format=csv
. XGBoost/dmlc-core doesn't guess the format and is a source of bugs when users pass in only the file name.
Also, do you plan to introduce the need_split
with other input sources as well? If not, we can make it an URI parameter and limit its use to this function. Otherwise, we need to have an additional parameter in all language bindings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to URI.
For need_split
, we can get rid of it if we are just going to preserve the current behavior, which is to split for distributed training, no otherwise. If a user wants more flexibility, they can always use another api to load the data. For distributed training, most people are probably not using the file api anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it as a parameter.
@trivialfis finally got the CI to pass. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the work on preparing the DMatrix for column split!
We need this information to change the behavior of the predictor when data is split column-wise.
Part of #8424